{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Meta-SGD\n",
    "\n",
    "Let us say we have some task $T$. We use a model  $f$ parameterized by some parameter  $\\theta$ and train the model to minimize the loss. We minimize the loss using gradient descent and find the optimal paramter $\\theta$ for the model. \n",
    "\n",
    "Let us recall the update rule of a gradient descent, \n",
    "\n",
    "$\\theta = \\theta - \\alpha  \\nabla_{\\theta} L_{T_i}(f_{\\theta}) $\n",
    "\n",
    "So What are all the key elements that make up our gradient descent?\n",
    "\n",
    "- Parameter $\\theta$\n",
    "- Learning rate $\\alpha$\n",
    "- Update direction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We usually set the parameter  $\\theta$ to some random value and try to find the optimal value during our training process and we set the value of learning rate $\\alpha$ to a small number or decay it over time and update direction follows the gradient. Can we learn all the key elements of gradient descent by meta learning? so that we can learn quickly from smaller datasets. We have already seen in the last chapter, how MAML, finds the better initialization parameter $\\theta$ that is generalizable across tasks. With the better initial parameter, we can take less gradient steps and learn quickly on a new task. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "So, now can we also learn the optimal learning rate and update direction that is generalizable across tasks so that it leads to faster convergence and we can train quickly? Let us see how can we learn this in meta-SGD by comparing it with MAML. If you recall, in MAML inner loop we find the optimal parameter $\\theta_i'$ for each task $T_i$ by minimizing the loss through gradient descent. \n",
    "\n",
    "i.e $\\theta'_i = \\theta - \\alpha  \\nabla_{\\theta} L_{T_i}(f_{\\theta}) $"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For meta-SGD, We can rewrite the above equation as,\n",
    "\n",
    "i.e $ \\theta'_i = \\theta - \\alpha \\circ \\nabla_{\\theta} L_{T_i}(f_{\\theta}) $"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "but what's the difference here? Here $\\alpha$ is not just a scalar small value but a vector. we initialize  $\\alpha$ randomly with same shape as $\\theta$ We call $\\theta$ as initial parameter and  $\\alpha \\circ \\nabla_{\\theta} L_{T_i}(f_{\\theta})$ as an adaptation term. So, the adaptation term  $\\alpha \\circ \\nabla_{\\theta} L_{T_i}(f_{\\theta})$ represents the update direction and its length becomes the learning rate. We update our values in  direction $\\alpha \\circ \\nabla_{\\theta} L_{T_i}(f_{\\theta})$ instead of gradient direction  $\\nabla_{\\theta} L_{T_i}(f_{\\theta}) $  and our learning rate is implicitly implemented in the adaptation term."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, in meta-SGD, we don't initialize a learning rate $\\alpha$ with some small scalar value. Instead, we initialize them with random values with same shape as $\\theta$ and learn them along with \\theta$. So, we sample some batch of tasks and for each task, we sample some k data points and minimize the loss using gradient descent but our update equation becomes, \n",
    "\n",
    "$ \\theta'_i = \\theta - \\alpha \\circ \\nabla_{\\theta} L_{T_i}(f_{\\theta}) $"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "i.e our update direction is the adaptation term direction and not the gradient direction and also we learn  $\\alpha$ long with $\\theta$. \n",
    "\n",
    "Now in the outer loop, we perform meta optimization i.e we calculate gradients of loss with respect to optimal parameters $\\theta'_i$  and update our randomly initialized model parameter $\\theta$ . In meta SGD, instead of updating $\\theta$ alone, we also update our randomly initialized $\\alpha$ as,  \n",
    "\n",
    "\n",
    "$\\theta = \\theta - \\beta \\nabla_{\\theta}  \\sum_{T_i \\sim p(T)} L_{T_i} (f_{\\theta'_i}) $\n",
    "\n",
    "\n",
    "$\\alpha = \\alpha - \\beta \\nabla_{\\alpha}  \\sum_{T_i \\sim p(T)} L_{T_i} (f_{\\theta'_i}) $"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you can notice Meta-SGD is just a small tweak over MAML. In MAML, we randomly initialize the model parameter $\\theta$ and try to find the optimal parameter that is generalizable across tasks. In meta-SGD instead of just learning the model parameter $\\theta$, we also learn the learning rate and update direction which is implicitly implemented in the adaptation term. \n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
